Imagine you have a dataset and have no variable that you will predict. In case you have a datasets with no target variable, the learning of a machine has to be unsupervised. Hence the learning will be done based on several measure of similarity or distance between each observation in the dataset. The most commonly used technique in supervised learning is clustering.
Source: Research Gate
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). In this article, we will discuss about one of the common clustering algorithms, that is kmeans.
K-means is a centroid-based clustering algorithm that follows a simple procedure of classifying a given dataset into a pre-determined number of clusters, denoted as “k”. We will discuss about one use case that can be done using kmeans algorithm.
The use case that will be discuss here is fraud analysis in mobile financial industry.
PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world. The dataset is downloaded from kaggle.
## step type amount nameOrig
## Min. :387.0 Length:2 Min. :0 Length:2
## 1st Qu.:472.8 Class :character 1st Qu.:0 Class :character
## Median :558.5 Mode :character Median :0 Mode :character
## Mean :558.5 Mean :0
## 3rd Qu.:644.2 3rd Qu.:0
## Max. :730.0 Max. :0
## oldbalanceOrg newbalanceOrig nameDest oldbalanceDest
## Min. :0 Min. :0 Length:2 Min. :1008610
## 1st Qu.:0 1st Qu.:0 Class :character 1st Qu.:2749149
## Median :0 Median :0 Mode :character Median :4489688
## Mean :0 Mean :0 Mean :4489688
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:6230228
## Max. :0 Max. :0 Max. :7970767
## newbalanceDest Fraud FlaggedFraud
## Min. :1008610 Length:2 Length:2
## 1st Qu.:2749149 Class :character Class :character
## Median :4489688 Mode :character Mode :character
## Mean :4489688
## 3rd Qu.:6230228
## Max. :7970767
## step type amount nameOrig
## Min. : 1.0 Length:636262 Min. : 0 Length:636262
## 1st Qu.:156.0 Class :character 1st Qu.: 13385 Class :character
## Median :239.0 Mode :character Median : 74692 Mode :character
## Mean :243.5 Mean : 179872
## 3rd Qu.:334.0 3rd Qu.: 208882
## Max. :743.0 Max. :56808983
## oldbalanceOrg newbalanceOrig nameDest
## Min. : 0 Min. : 0 Length:636262
## 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 14075 Median : 0 Mode :character
## Mean : 831228 Mean : 852353
## 3rd Qu.: 107190 3rd Qu.: 143775
## Max. :59585040 Max. :49585040
## oldbalanceDest newbalanceDest Fraud
## Min. : 0 Min. : 0 Length:636262
## 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 132805 Median : 215099 Mode :character
## Mean : 1101823 Mean : 1226364
## 3rd Qu.: 941260 3rd Qu.: 1112335
## Max. :355185537 Max. :355380484
## FlaggedFraud
## Length:636262
## Class :character
## Mode :character
##
##
##
##
## FALSE TRUE
## 118028 833
Variables Description:
step = maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).type = CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.amount = amount of the transaction in local currency.nameOrig = customer who started the transactionoldbalanceOrg = initial balance before the transactionnewbalanceOrig = new balance after the transactionnameDest = customer who is the recipient of the transactionoldbalanceDest = initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).newbalanceDest = new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).Fraud = This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.FlaggedFraud = The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.paysim_num <- paysim %>%
select_if(is.numeric) %>%
scale()
paysim_num <- as.data.frame(paysim_num) %>%
select(-c(newbalanceOrig, oldbalanceDest))wss <- function(data, maxCluster = 9) {
# Initialize within sum of squares
SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
SSw <- vector()
for (i in 2:maxCluster) {
SSw[i] <- sum(kmeans(data, centers = i)$withinss)
}
plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}The elbow curve suggests that with four clusters, we were able to explain most of the variance in data. Beyond five clusters adding more clusters is not helping with explaining the groups (WCSS is saturating after four).
paysim %>%
select(-c(type,nameOrig, nameDest, Fraud, FlaggedFraud)) %>%
group_by(cluster) %>%
summarise_all("mean")paysim %>%
select(-c(type,nameOrig, nameDest, FlaggedFraud)) %>%
group_by(Fraud) %>%
summarise_all("mean")##
## 1 2 3 4 5 6
## No 53 4891 5173 107544 367 0
## Yes 1 263 9 481 0 79
Kalau datanya high dimensional, pakai PCA dulu baru K-Means, ini perlu ditekankan juga
The data is downloaded from kaggle
## Observations: 200
## Variables: 5
## $ CustomerID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...
## $ Gender <fct> Male, Male, Female, Female, Female, Fem...
## $ Age <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30,...
## $ Annual.Income..k.. <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19,...
## $ Spending.Score..1.100. <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14...
CustomerID = Unique ID assigned to the customerGender = Gender of the customerAge = Age of the customerAnnual Income = (k$) Annual Income of the customerSpending Score = (1-100) Score assigned by the mall based on customer behavior and spending naturemall$cluster <- mall_km$cluster
mall %>%
select(-c(CustomerID, Gender)) %>%
group_by(cluster) %>%
summarise_all("mean")ins <- read.csv("ins_subset.csv")
ins_subset <- ins %>%
select(witnesses, bodily_injuries, total_claim_amount, number_of_vehicles_involved, capital.gains)##
## N Y
## 1 157 55
## 2 168 16
## 3 166 64
## 4 128 59
## 5 134 53